Let us say you want to know if a random day in the year is a winter day, is a snow day, and what the relationship between winter and snow days is.
We start by drawing a circle, which symbolizes all days in a year:
This circle captures all 4 possible day types: winter days with and without snow and days in the other seasons with and without snow. The probability that a day belongs to any of these day types is 1. Therefore, we say that the area of the circle is also 1.
We know that a quarter of the days in a year are in winter months. We visualize this by highlighting a section of the circle in icy blue.
From this we can see that the probability of winter is 1/4, or 0.25. This is an unconditional (or marginal) probability, and we can write \(P(winter)\) = 0.25.
Snowy days are usually in winter. Therefore, we draw an ellipse for snow days that largely overlaps with the quadrant for winter days.
The unconditional (or marginal) probability of snow, \(P(snow)\), is the probability of snow independent if it is winter or not, and corresponds to the area of the grey ellipse: \(P(snow) =\) 0.14.
The joint probability of winter and snow \(P(winter, snow)\) or \(P(winter \textrm{ and } snow)\) is the ellipse-area that overlaps with the bottom right quadrant, which is shown in the next figure.
A conditional probability gives the answer to the question “Given that it is winter, what is the probability of snow”. This amounts to the question “What is the size of the ellipse-area that overlaps with the bottom right quadrant, relative to the area of the bottom right quadrant”. The next figure visualizes this.
To obtain a conditional probability, we are “conditioning” our question about the probability of snow on the value of another variable, namely that the season is winter. In comparison, the joint probability asks “What is the size of the trimmed ellipse-area, relative to the total circle area of 1”.
The following figures show fractions that make even clearer hat the joint probability is the probability of snowy winter days divided by the probability of any day (which is just one), and that the conditional probability is the probability of snowy winter days divided by the probability of winter days.
Note that because the probability of a winter day is smaller than 1 and dividing by a number smaller than one makes a number larger, dividing by the marginal probability of winter ensures that the conditional probability is larger than the joint probability. This makes sense, as snow in winter has to be more likely than winter-snow in the whole year.
So, to calculate the conditional probability \(P(snow|winter)\) we want to answer the question “What is the size of the ellipse-area that overlaps with the bottom right quadrant, relative to the area of the bottom right quadrant”. Unfortunately, there is no simple equation for calculating the overlap of an ellipse and a circle quadrant. But we can approximate the correct answer by randomly placing points into the circle and checking if they are in the bottom right quadrant (winter), in the ellipse (snow), or in the area of the ellipse that overlaps with the bottom right quadrant (winter and snow).
Here is the code for the simulation. You don’t need to understand all
of it, the important part is that we are keeping track of winter and
snow days in the vectors with the names is.winter and
is.snow, respectively.
set.seed(123)
# draw the background
draw_pie()
draw_ellipse()
N = 365
is.winter = vector(length = N) # vector to count winter days
is.snow = vector(length = N) # vector to count snow days
for (k in 1:N) {
# generate random point with custom function
xy = rpoint_in_circle()
# check if it is a snow day, i.e. in ellipse, with custom function
is.snow[k] = in_ellipse(xy,h.e,k.e,a.e,b.e,e.rot)
# check if it is a winter day
is.winter[k] = xy[1] > 0 & xy[2] < 0
# plot points
points(xy[1],xy[2],
pch = ifelse(is.snow[k] == T,8,21), cex = .75,
bg = ifelse(is.winter[k] == T,"blue","red"),
col = ifelse(is.winter[k] == T,"blue","red"))
}
legend(.75,.8,
pch = c(8,21,15,15), bty = "n",
col = c("black","black","blue","red"),
legend = c("snow","no snow", "winter", "no winter"))
Let’s first calculate the probability of winter, which should be around 0.25. This is simply the number of blue dots divided by the total number of dots.
N_winter = sum(is.winter)
P_winter = N_winter/N
P_winter %>% round(2)
## [1] 0.25
Now, the probability of snow (star-shaped dots divided by total number of dots):
N_snow = sum(is.snow)
P_snow = N_snow/N
P_snow %>% round(2)
## [1] 0.15
And now it gets interesting. For the joint probability of winter and snow \(P(winter, snow)\) we count the number of blue stars.
# logical indexing:
# is.snow[is.winter] returns only those entries of the
# vector is.snow that are at positions where the
# value for is.winter is TRUE
N_winter_and_snow = sum(is.snow[is.winter])
P_winter_and_snow = N_winter_and_snow/N
P_winter_and_snow %>% round(2)
## [1] 0.12
Check the last code block and see that to get the joint probability
we divide by the total number of dots N.
In contrast, for conditional probabilities, we want to divide by the number of dots that have the value we are conditioning on. If we want to calculate the conditional probability \(P(snow | winter)\), we therefore have to divide by the number of winter dots:
P_snow_given_winter = N_winter_and_snow/N_winter
P_snow_given_winter %>% round(2)
## [1] 0.48
If you check further above, you can see that
P_winter_and_snow and P_winter are
respectively calculated by dividing N_winter_and_snow and
N_winter with N. Therefore,
N_winter_and_snow/N_winter and
P_winter_and_snow/P_winter have the same result and we can
also write:
P_snow_given_winter = P_winter_and_snow/P_winter
Hopefully, you recognize now that we have the conditional probability
on the left side (P_snow_given_winter or \(P(snow|winter)\)), which we calculate with
help of the joint probability (P_winter_and_snow or \(P(snow, winter)\)) and the unconditional
(marginal) probability (P_winter or \(P(winter)\)) on the right side:
\[ \overset{\color{violet}{\text{conditional probability}}}{P(snow|winter)} = \frac{\overset{\color{red}{\text{joint probability}}}{P(snow, winter)}}{\overset{\color{blue}{\text{marginal probability}}}{P(winter)}} \]
or more abstract:
\[ \overset{\color{violet}{\text{conditional probability}}}{P(A|B)} = \frac{\overset{\color{red}{\text{joint probability}}}{P(A, B)}}{\overset{\color{blue}{\text{marginal probability}}}{P(B)}} \] and by multiplying with \(P(B)\) on both sides, we get first
\[ \overset{\color{violet}{\text{conditional probability}}}{P(A|B)} \cdot \overset{\color{blue}{\text{marginal probability}}}{P(B)} = \overset{\color{red}{\text{joint probability}}}{P(A,B)} \]
which is the same as
\[ \overset{\color{red}{\text{joint probability}}}{P(A,B)} = \overset{\color{violet}{\text{conditional probability}}}{P(A|B)} \cdot \overset{\color{blue}{\text{marginal probability}}}{P(B)} \]
This is the general product rule (or chain rule) that connects conditional probabilities with joint probabilities.
Exercise 2E3, which asks which expressions are consistent with “probability of Monday given that it is raining”, can be used to derive Bayes rule.
The correct answers are.
1: P(Monday | rain)
4: P(rain | Monday) * P(Monday) / P(rain)
The question is then if we can show that
\(P(Monday|rain) = P(rain|Monday) \cdot P(Monday) / P(rain)\)?
The key to the solution is that we can use both \(P(Monday|rain)\) and \(P(rain|Monday)\) to calculate the same thing: the joint probability of \(P(Monday,rain)\).
Page 37 shows the relationship of joint and conditional probability:
\[ P(A,B) = \color{blue}{P(A|B) \cdot P(B)} \\ P(A,B) = \color{red}{P(B|A) \cdot P(A)} \]
Therefore, we can say that the joint probability that it is Monday and raining is
\[P(Monday,rain) = \color{blue}{P(rain|Monday) \cdot P(Monday)}\]
or
\[P(Monday,rain) = \color{red}{P(Monday|rain) \cdot P(rain)}\] and we can further say
\[ \color{red}{P(Monday|rain) \cdot P(rain)} = \color{blue}{P(rain|Monday) \cdot P(Monday)} \]
If we want to know what P(Monday|rain) is, we now have to divide on both sides with P(rain), which gives us
\[ \color{red}{P(Monday|rain)} = \frac{\color{blue}{P(rain|Monday) \cdot P(Monday)}}{\color{red}{P(rain)}} \]
where the right hand side is answer 4 from above.
Maybe the last equation looks familiar. To make it a bit more recognizable, we can replace Monday with \(A\) and \(rain\) with \(B\):
\[ P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} \] This is Bayes Rule, which one uses to calculate the inverse conditional probability, i.e. when we have information about the probability of \(B\) given \(A\) and want to calculate the probability of \(A\) given \(B\).
A probability distribution is a function, that is an object that receives input and gives an output. This is very general, and so we can have functions that return always the same value (\(f(x) = 0\)), that square a value (\(f(x) = x^2\)), or that check if a certain condition is met (\(f(x) = 1(X == 5)\). 1
In the context of Bayesian statistics analysis, we use probability distributions to describe a state of the world when we remain uncertain.
For such a function, we first need to be explicit what we are uncertain about. Continuing with the globe tossing example, we can say that we are uncertain about if we will catch the globe with the tip of the index finger on water.
If we were sure to land on water, we would say \(P(water) = 1\), and if we were sure to land on land, we would say \(P(water) = 0\). But because we are uncertain, anything between 0 and 1 is possible. Therefore, our function to describe this uncertainty should allow values between 0 and 1. We can start drawing the function by just specifying an x axis that goes from 0 to 1.
Before we display uncertainty, let’s look at how this function looks if we are certain to land on water:
par(mar=c(3,2,0,1))
plot(0,type = "n",
ylim = c(0,1.01),
xlim = c(0,1),
ylab = "", xlab = "p",
bty = "n")
arrows(1,0,1,1, col = "blue", length = .1)
lines(c(0,1),c(0,0), col = "blue")
If we are certain to land on water, \(P(water)\) = 12 and all other probabilities have the value zero.
If we want to express uncertainty, we also have to allow for all other values. If we had no information whatsoever to say something about the probability to land on water, all probabilities should get the same value.
dbeta(x = 0.5, shape1 = 1, shape2 = 1) returns the density
for the value 0.5 under the beta distribution where the parameters
shape1 and shape2 have the value 1.
dnorm(x = 0.5, mean = 1, sd = 1) returns the density for
the value 0.5 under the normal distribution with a mean of 1 and a sd of
1.
For this function to be a probability distribution, the area under the function (the integral) must sum up to 1.
To see this, we can observe in the next plot that the area under the probability function remains constant while we go from believing weakly (left) to more strongly (right) that the probability to land on water is larger than 0.5.
rnorm(n = 1000, mean = 1, sd = 1)
returns 1000 random numbers from a normal distribution with the same
parameters.
To summarize, one can think of a probability distribution as a function that expresses how likely different values of a parameter (here p) are and whose area under the curve (or integral) is 1.
Depending on the nature of a parameter, different probability distributions must be used. Above, we use the so called beta distribution, because this distribution allows values between 0 and 1, which matches the fact that probabilities need to be between 0 and 1. For other phenomena different distributions can be used. For instance, we might want to use a normal distribution to characterize our uncertainty about tomorrow’s temperature or a Poisson distribution to characterize uncertainty about things we count, like e.g. the number of shoes a person has.
In Bayesian statistics, we use such distributions to express three things:
Let’s walk through a simple example. We start by describing our prior judgement, that we are slightly confident that that index finger touches water rather than land, with a beta distribution:
dbeta function for the prior
Next the likelihood. For the globe tossing example, we can think of each toss as a trial and of each landing on the index finger as a success. The distribution that gives the likelihood of different success probabilities \(p\) given a number of trials and successes is the binomial distribution. So we use this distribution to get the likelihood function. Let us assume we had 4 trials and 3 successes.
dbinom function.
(dbinom(x = 3, size = 4, prob = p), where p is
a vector with parameter values.)
Now Let us re-introduce Bayes rule, which we described above as:
\[ P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)} \]
If we just replace \(A\) with \(parameter\) and \(B\) with \(data\) and annotate the different terms we get
\[ \overset{\color{violet}{\text{posterior probability}}}{P(parameter|data)} = \frac{\overset{\color{red}{\text{likelihood}}}{P(data|parameter)} \cdot \overset{\color{blue}{\text{prior probability}}}{P(parameter)}}{\overset{\color{orange}{\text{evidence}}}{P(data)}} \]
This shows us that if we just multiply the likelihood with the prior, two things we just calculated and plotted above, we get something that is proportional to the posterior probability of the parameter (probability to land on water with the index finger on water) given the data. This is what is meant if you see this expression:
\[ posterior \propto likelihood \cdot prior \]
Let us just calculate and plot this:
prior_x_likelihood = prior * likelihood
This distribution is only proportional to the posterior distribution, because the product of posterior and likelihood does not sum up to 1. We can calculate the posterior probability distribution by dividing by the sum.
s = sum(prior_x_likelihood)
posterior = prior_x_likelihood/s
c(1/s, sum(posterior))
## [1] 77.34548 1.00000
The following figure illustrates how we get from the the
un-normalized posterior to the normalized posterior, which sums to 1, by
multiplying with a constant, which is just
1/prior_x_likelihood.
The next plot shows the posterior distribution together with the prior distribution and the likelihood. We are also adding a plot for the un-normalized posterior with a dotted line.
The figure shows that the posterior is a compromise between the prior distribution and the likelihood. That is, it is a compromise between our information before we saw the data and the information that is in the data.
Because we had relatively little data compared to the information in the prior, we can still clearly see the influence of the prior in the posterior. However, if we collect five times the data, we become more certain (the posterior distribution is narrower) and the influence of the prior is diminished so that the posterior will be very similar to the likelihood:
So it is not so easy to “cheat” with priors to get what one wants, provided one has collected sufficient data of course.
What is the posterior probability of land, given 10 W and 3 L tosses?
First we defined a grid and plot a the prior probability values:
p_grid = seq(0,1,by = .05)
prior = dbeta(p_grid,2,1)
plot(p_grid, prior, type = "h", col = "blue",
ylab = "density", main = "Prior")
Vertical lines indicate the prior plausibility for parameter values in the grid.
Next we calculate the likelihood, i.e. the probability of the data given the model (a binomial distribution), the data (3 W, 3 L) and the parameter p (in p_grid):
likelihood = dbinom(10,13,p_grid)
plot(p_grid, likelihood, type = "h", col = "red",
ylab = "density", main = "Likelihood")
Vertical lines indicate the plausibility of the data for the parameter values in the grid.
And now we can calculate the un-normalized posterior as a product of prior and likelihood:
posterior = prior * likelihood
plot(p_grid, posterior, type = "h", col = "violet",
ylab = "density", main = "Posterior")
Vertical lines indicate the posterior plausibility of the parameter values in the grid, given the data and the prior.
Here is a plot with all three together:
par(mar = c(5.1,4.1,1,2.1))
ylim = c(0,max(c(likelihood,posterior,prior)))
plot(p_grid, prior, type = "h",col = "blue",
ylab = "density", ylim = ylim)
lines(p_grid+.005, likelihood, type = "h", col = "red")
lines(p_grid-.005, posterior, type = "h", col = "violet")
legend("topleft",col = c("blue","red","violet"),
lty = 1, legend = c("Prior","Likelihood","Posterior"),
bty = "n")
We can make the plot easier to view by normalizing all values so that they sum up to 1 for prior, likelihood and posterior. In each distribution, only the relative values at the different points of the grid are relevant!
n_prior = prior/sum(prior)
n_likelihood = likelihood/sum(likelihood)
n_posterior = posterior/sum(posterior)
n_ylim = c(0,max(c(n_likelihood,n_posterior,n_prior)))
par(mar = c(5.1,4.1,1,2.1))
plot(p_grid, n_prior, type = "h", col = "blue",
ylab = "normalized density", ylim = n_ylim)
lines(p_grid+.005, n_likelihood, type = "h", col = "red")
lines(p_grid-.005, n_posterior, type = "h", col = "violet")
legend("topleft",col = c("blue","red","violet"),
lty = 1, legend = c("Prior","Likelihood","Posterior"),
bty = "n")
This plot helps to understand that for a grid search we take each point of the grid and:
However, usually one would not show distributions with vertical lines. Instead, one just shows their outlines:
par(mar = c(5.1,4.1,1,2.1))
plot(p_grid, n_prior, col = "blue", type = "l",
ylab = "normalized density", ylim = n_ylim)
lines(p_grid+.005, n_likelihood, col = "red")
lines(p_grid-.005, n_posterior, col = "violet")
legend("topleft",col = c("blue","red","violet"),
lty = 1, legend = c("Prior","Likelihood","Posterior"),
bty = "n")
This distribution is not very smooth. We can simply make it smoother by increasing the gridsize:
fp_grid = seq(0,1,by = .001)
f_prior = dbeta(fp_grid,2,1)
f_likelihood = dbinom(10,13,fp_grid)
f_posterior = f_prior * f_likelihood
f_prior = f_prior/sum(f_prior)
f_posterior = f_posterior/sum(f_posterior)
f_likelihood = f_likelihood/sum(f_likelihood)
f_ylim = c(0,max(c(f_likelihood,f_posterior,f_prior)))
par(mar = c(5.1,4.1,1,2.1))
plot(fp_grid, f_prior, type = "l", col = "blue",
ylab = "normalized density", ylim = f_ylim)
lines(fp_grid+.005, f_likelihood, col = "red")
lines(fp_grid-.005, f_posterior, col = "violet")
legend("topleft",col = c("blue","red","violet"),
lty = 1, legend = c("Prior","Likelihood","Posterior"),
bty = "n")